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Abstract. Approximate query processing (AQP) is an interesting alter¬ 
native for exact query processing. It is a tool for dealing with the huge 
data volumes where response time is more important than perfect accu¬ 
racy (this is typically the case during initial phase of data exploration). 
There are many techniques for AQP, one of them is based on probability 
density functions (PDF). PDFs are typically calculated using nonpara- 
metric data-driven methods. One of the most popular nonparametric 
method is the kernel density estimator (KDE). However, a very serious 
drawback of using KDEs is the large number of calculations required 
to compute them. The shape of final density function is very sensitive 
to an entity called bandwidth or smoothing parameter. Calculating it’s 
optimal value is not a trivial task and in general is very time consum¬ 
ing. In this paper we investigate the possibility of utilizing two SIMD 
architectures: SSE CPU extensions and NVIDIA’s CUDA architecture 
to accelerate finding of the bandwidth. Our experiments show orders 
of magnitude improvements over a simple sequential implementation of 
classical algorithms used for that task. 

Keywords: approximate query processing, graphics processing unit, 
probability density function, nonparametric estimation, kernel estima¬ 
tion, bandwidth selection 


1 Introduction 

The paper is about implementing an approximate query processing (AQP) tech¬ 
nique with the support of two SIMD architectures: SSE extensions of modern 
CPUs and Graphics Processing Units (GPUs). We propose modifications of clas¬ 
sical algorithms which perform parallel and concurrent computations to acceler¬ 
ate very time-consuming operations while calculating the optimal kernel density 
estimators (KDE), which heavily depends on the so called bandwidth or smooth¬ 
ing parameter. These estimators estimate probability density functions (PDF) 
which can be used in the AQP task. 
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The perfect situation is if every query that is sent to the database engine 
could return an exact solution in no more than seconds. However, if the database 
stores really huge amount of data (as it is the most typical in data warehouses, 
DW) such a perfect behavior is not a general rule. A lot of theoretical as well 
as practical works have been done so far in the area of DW optimization [53]. 
The research is usually concentrated on such aspects as designing a dedicated 
logical and physical data structures (novel schemes for indexing, materialized 
views, partitioning, etc.). Relatively less attention is paid for getting approximate 
results. Is seems obvious that if we have the following dilemma “what is better: 
getting the exact solution in 1 hour or getting only approximate solution in 
less then seconds”, we probably lean toward the second possibility. Getting only 
approximate solutions, instead of exact ones, in many practical situations is 
absolutely acceptable. For example, if our goal is to calculate the total gross 
income within the whole fiscal year, the roundoff to full thousands is obviously 
correct approach. There are at least a few different schemes for implementing 
AQP. We supply a brief review of them in section 12.11 

In the paper we concentrate on a technique based on analyzing statistical 
properties of data. We use probability density functions which give an elegant 
and efficient framework for implementing AQP. However, one of the most seri¬ 
ous drawback of this approach is that typical kernel nonparametric algorithms 
used in evaluating optimal PDFs scale quadratically (see section IT31 for precise 
mathematical formulas). If data sizes increase, such kernel methods scale poorly. 
To overcome this problem, two main approaches may be used. In the first one, 
many authors propose various methods of evaluating approximate kernel density 
estimates. We give a short summary of these methods in section [2^ In this paper 
we investigate the second approach where kernel density estimates are evaluating 
exaetly. To make this approach practically usable, all the time consuming com¬ 
putations are performed using two modern SIMD architectures. We describe 
two implementations of each of three different bandwidth finding algorithms. 
The first implementation (later called SSE implementation) utilizes two benefits 
of modern CPUs: SSE extensions allowing to perform the same instructions on 
multiple values in parallel as well as multicore architecture allowing to process 
several threads in parallel on several distinct cores. The second implementa¬ 
tion (later called GPU implementation) utilizes NVIDIA’s CUDA architecture 
to start multiple threads and perform as many computations concurently or in 
parallel as possible. The speedups we obtain are in the range of one to two orders 
of magnitude in comparison with classical sequential CPU-based implementation 
(later called Sequential implementation). 

The remainder of the paper is organized as follows. In section [5| we cover the 
necessary background material on APQ, computation of PDFs and bandwidth 
selection. In Section[3|we supply a brief review of modern SIMD architectures. In 
Section|4|we turn our attention to give the reader some preliminary informations 
on PDFs, kernel estimators of PDFs, using PDFs in database area. We also 
give the very detail mathematical formulas for calculating optimal bandwidth 
parameters using three different methods. We also propose some modifications of 
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the basic formulas to improve calculations performance. In section [5] we cover all 
the necessary details on parallel algorithms for bandwidth selection. In section [5] 
we show how to utilize the algorithms presented in section[5] In section[7]we give 
details on hardware and software used in experiments performed and present the 
results of the experiments (presented as speedups over different implementations, 
that is SSE, GPU and sequential ones). In section [5] we conclude our paper. In 
appendix|2]we give the details on how we derived some important formulas later 
used in our algorithms. 


2 Related works 

2.1 Approximate query processing 

Approximate query processing can be done in many different schemes m- They 
all assume applying a kind of preliminary data reduction which gives as a result 
a synopsis of the original input data. All the following data queries operate 
on this synopsis instead of the original data. Probably the simplest method of 
obtaining synopses is sampling. In this case we believe that data after sampling 
remain still sufficiently representative. This method is often called numerosity 
reduction M- There is also a complementary to numerosity reduction method 
called dimensionality reduction, but this technique is rather not used in the 
AQP area ITTTO] . Note also that many commercial RDBMS use sampling in the 
process of determining the best query execution plans. The synopsis can be built 
using such techniques as: histograms [TH], wavelets [33] or investigating statistical 
properties of the data [29] . The latter approach is investigated in the paper. 


2.2 Computation of probability density functions 

The probability density function (also called probability distribution, density 
function or simply density) is one of the most important and fundamental con¬ 
cept in statistics and it is widely used in exploratory data analysis. Fast calcula¬ 
tion of the optimal PDF is still an interesting scientific problem. There exist a lot 
of parametric forms of density functions, that is if their shapes can be described 
by a mathematical formula. A very complete list of probability density functions 
can be found for example in [33131] . On the other hand, if the parametric form 
of the PDF is unknown (or difficult to calculate) one should consider usage of 
nonparametric methods. 

The task of the PDF estimation is to compute an estimate / based on the 
given n sample points drawn from an unknown population of data with density /. 
One of the most popular nonparametric method is the kernel density estimator, 
see for example [35] for a complete review. 

There are two main computational problems related to KDE. Firstly, the 
calculation of the estimate f{x,h) (or f{x,H), see chanter 14.21 for shortened 
explanation on differences beetwen h and H) and secondly, estimation of the op¬ 
timal (in some sense) bandwidth parameter h (or H). A plethora of techniques 
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have been proposed for accelerating computational times of the first problem. 
The naive direct evaluation of the KDE at m evaluation points for n source 
points requires 0(mn) kernel evaluations. Evaluation points can be of course 
the same as source points and then the computational complexity is O(n^). The 
most commonly used method to reduce the computational burden for KDE is 
to use a technique known as binning or discretising. In such a case, the density 
estimation is evaluated at grid points rather than source points. The idea relies 
on generating grid points (not necessarily equally spaced) of the size g, where 
g and gi < ■ ■ ■ < gm- Then the original xi, - ■ ■ Xn source points are replaced 
by grid counts ci, ■ ■ ■ Cg, where the value of Ci depends on the ’’mass of the data” 
near gi. Binning strategy reduces the required kernel evaluations to 0{mg). Fur¬ 
thermore, if the evaluation points are the same as grid points, the further kernel 
evaluation from 0{g^) to 0{g) is possible (as certain evaluations use the same 
arguments and don’t need to be calculated again). Another approach toward 
saving computational complexity is based on using Fast Fourier Transformation 
(EFT), first proposed in [32]. Using EFT requires that the source points are on 
an evenly spaced grid and then one can evaluate KDE at an evenly spaced grid 
in 0(nlogn)- If the source points are irregularly spaced the pre-binning strategy 
described above should be applied first. The resulting KDE is also evaluated 
at regular evaluation points. If irregular target points are required, a sort of 
interpolation based on neighboring evaluation points should be applied. In the 
FFT-based approach however there is a potential setback connected with an 
aliasing effect which is not completely bening. This problem is investigated in 
details in m- Another technique which reduces the computational complexity 
is based on Fast Gauss Transform (FGT) introduced in [13] and can be viewed 
as an extension of the Improved Fast Gauss Transform (IFGT) [37]. The method 
is called by the authors e — exact |28j in the sense that the constant hidden in 
0(m -|- n) depends on the desired accuracy which can be chosen arbitrary. 

As for the problem of accelerating computational times for finding the op¬ 
timal bandwidth h relatively less attention is payed in literature. An attempt 
at using Message Passing Interface (MPI) was presented in [35]. In [3T] the au¬ 
thor gives an FFT-based algorithm for accelerating a method (least-square cross 
validation one) for hnding the optimal bandwidth h. 

2.3 Bandwidth selection for kernel density estimates 

Bandwidth selection problem is probably the most important one in the KDE 
area. Fast and correct bandwidth selection is the clue to practical usability of 
kernel-based density estimates of PDFs. Currently available selectors can be 
roughly divided into 3 classes |35l3ll3()| . The first class contains very simple 
and easy to calculate mathematical formulas. They were developed to cover a 
wide rage of situations, but do not guarantee being close to the optimal band¬ 
width. They are however willingly and often used as a starting point in more 
accurate bandwidth selection process. These methods are sometimes called rules- 
of-thumb. The second class contains methods based on least square and cross- 
validation ideas and more precise mathematical arguments. But unfortunately 
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they require much more computational effort. However, in reward for it, we get 
band widths more accurate for a wider range of density functions. The method 
will be abbreviated as LSCV. The third class contains methods based on pluging 
in estimates of some unknown quantities that appear in formulas for the asymp¬ 
totically optimal bandwidth. The methods are called plug-in ones and hereafter 
will be denoted as PLUGIN. These methods are also computationally difficult 
because there is a need for computation of some functionals and the direct algo¬ 
rithm involves O(n^) operations. The computational burden can be reduced by 
using binning strategy as briefly described in section [2^ 


3 Single Instruction Multiple Data architectures 

Single Instruction Multiple Data (SIMD) processing systems process multiple 
streams of data based on a single instruction stream thereby exploiting the data 
level parallelism. First instroduced as a feature of vector supercomputers such 
as GDC Star-100 and Thinking Machines CM-1 and CM-2 SIMD processing 
was later implemented in INTEL’S commodity processors. A similar approach, 
though a little more flexible was implemented in modern GPUs. 


3.1 Parallel data processing on commodity CPUs 

Starting with Pentium MMX, commodity GPUs have started supporting Single 
Instruction Multiple Data processing. This was later extended by both Intel[3] 
and AMD [211] in subsequent extensions called 3DNow!, SSE, SSE2, SSE3, 
SSSE3, SSE4, XOP, FMA4, GVT16 (former SSE5) and AVX. GPUs supporting 
these technologies contain additional registers capable of storing multiple values. 
These are essentially vectors, which may be processed by specialized instructions 
as a whole. For example two 4 value vectors stored in two registers may be added 
by using a single GPU instruction. 

SSEl (Streaming Simd Extensions) introduced 128bit registers capable of 
storing four single precision floating point values as well as a set of 70 GPU 
instructions for processing them. SSE2 added the possibility to store in registers 
two double precision values instead of four single precision and added additional 
144 instructions. SSE3 introduces 13 new instructions, including the ones with 
capability to perform operations on values stored within the same register. SSE4 
added 54 instructions which were (among others) useful for operations performed 
in HD codecs and for string processing. SSE5 instruction set extension was pro¬ 
posed by AMD on 30 August 2007 as a supplement to the 128-bit SSE core 
instructions in the AMD64 architecture. In May 2009, AMD replaced SSE5 with 
three smaller instruction set extensions named as XOP, FMA4, and GVT16, 
which retain the proposed functionality of SSE5, but encode the instructions 
differently for better compatibility with Intel’s proposed AVX instruction set. 
AVX extension increases the length of SIMD registers to 256 bits and adds three 
operand instructions where the destination register is distinct from the source 
registers. 
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SIMD is of course not the only level of parallel data processing on modern 
CPUs. Other solutions used in conujnction with SIMD are: superscalar architec¬ 
ture (multiple execution units and multiple pipelines) as well as multiple cores. 
Our implementations, aside from SIMD also utilize modern CPUs capability to 
run multiple threads in parallel on multiple cores. 

In our implementations we primarily use __ml28 type variables which corre¬ 
spond to 128bit registers holding 4 single precision values. 

3.2 General processing on graphics processing units - GPGPU 

Rapid development of graphics cards driven by the video game market as well as 
many professional applications (such as movie special effects) has led to creating 
devices far more powerful than standard CPUs. Graphics cards are of course 
more specialized than general-purpose CPUs. Typical processing of graphics al¬ 
gorithms involves performing of the same algorithm steps for many different 
input values (such as applying geometrical transformations to vertices or com¬ 
puting pixel colors) and is akin to SIMD concept. However, if any algorithm (not 
necessarily related to computer graphics) may be mapped to such an approach 
to data processing, it may be efficiently executed on a graphics card. In the be¬ 
ginning, programs utilizing GPUs for general purpose processing used a graphics 
API such as OpenGL-|-GLSL or Direct3D-|-HLSL. This caused some restrictions 
on the implemented algorithms (lack of the scatter operation) as well as required 
from the programmer to create some mapping between the algorithm operations 
and graphics operations. These problems vanished when NVIDIA GUDA and 
OpenCL were developed. Both of these APIs allow the programmer to com¬ 
pletely omit the graphics API and use the standard programming language con¬ 
structs to create programs for GPUs. In our paper we use the NVIDIA GUDA 
API, which is closely related to the NVIDIAs graphics cards architecture and 
therefore allows for some low level optimization. 

Let us now roughly describe the NVIDIA GPU architecture and its relation 
to the GUDA API[1]. NVIDIA GPUs are composed of many multiprocessors. 
Each multiprocessor (SM) is composed of several streaming processors (SP). 
Each streaming processor is capable of performing logical and integer operations 
as well as single precision floating point. Groups of SPs on a single SM share a the 
so-called warp scheduler which issues successive instruction or each of the SPs 
in the group. Gonsequently, each SP in a group performs the same instruction at 
the same time (SIMD). Gurrent (2012) graphics cards contain 30 SM with 8 SP 
and 1 warp scheduler each (NVIDIA GeForce 285GTX), or 16 SM with 32 SP 
and 2 warp schedulers each (NVIDIA GeForce 580GTX). This shows, that the 
GPUs are capable of running several hundred threads in parallel (and even more 
concurrently). Each NVIDIAs graphics card has assigned a compute capability 
(denoted GG for brevity) version which specifies which features are supported 
by the given graphics card. Each multiprocessor also contains a small but fast 
on-chip memory called the shared memory. The tasks are not assigned to SMs 
directly. Instead, the programmer first creates a function called a kernel, which 
consists of a sequence of operations which need to be performed concurrently in 
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many threads. To distinguish from kernels used in kernel-based density estimates, 
we will call these functions the gpu-kernels. Next, the threads are divided into 
equally sized blocks. A block is a one, two or three dimensional array of at most 
1024 threads (or 512 threads on graphics cards with CC<1.3), where each thread 
can be uniquely identified by its position in this array. The obtained blocks form 
the so-called computation grid. When the gpu-kernel is started, all of the blocks 
are automatically distributed among the SMs. Each SM may process more than 
one block, though one block may be executed at only one SM. Threads in a single 
block may communicate by using the same portion of the SMs shared memory. 
Threads run in different blocks may only communicate through the very slow 
global memory of the graphics card. Synchronization capabilities of the threads 
are limited. Threads in a block may be synchronized, but global synchronization 
is not possible, though a (costly) workaround exists. Threads in a block are 
executed in 32 thread SIMD groups called warps (this is the consequence of 
having one warp scheduler per several SPs). Consequently all of these threads 
should perform the same intruction. If the threads with a warp perform different 
code branches, all branches are serialized, and threads not performing the branch 
are masked. Perfomance of an implementation is therefore determined by: 

1. the number of global memory accesses (the smaller, the better, use shared 
memory in favor of global memory). 

2. the number of global synchronization points (the smaller, the better, use 
in-block synchronization in favor of global synchronization). 

3. the parallelism degree (the bigger, the better). 

4. each group of 32 consecutive threads should follow the same code branches. 

There are also several other efficiency guidelines, which do not stem from the 
above description, but are related to some lower level details of graphics cards 
hardware: 

1. the smaller, the number of conditional code executions the better. 

2. each thread in a group of 16 (or 32 for graphics cards with CC>2.0) con¬ 
secutive threads should follow a conflict-free shared memory access pattern 
(see g]). 

3. each thread in a group of 16 (or 32 for graphics cards with CC>2.0) con¬ 
secutive threads should use global memory access patterns which allow for 
coalesced accesses (see g]). 

4. use single precision floating point computations, if possible. 

All of our implementations comply with the above guidelines. Regarding the last 
requirement, our implementations use single precision, but in the near future we 
can expect, that double precision will be efficient on GPUs as well. Our solutions 
can be then easily ported to support double precision. It should also be noted, 
that the efficiency of double precision has already increased dramatically between 
two recent generations of graphics cards (compare for example NVidia GeForce 
285GTX and NVidia GeForce 580GTX). 
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4 Mathematical preliminaries 

In this section we give some preliminary informations on some basic statisti¬ 
cal concepts (probability density function and kernel density estimation) as well 
as how to use them in the database area. Most of this section is devoted to 
give the precise recipes for calculation of the so called optimal bandwidth which 
plays the most important role while estimating kernel-based probability den¬ 
sity functions. We also propose some slight but important modifications of the 
reference mathematical formulas for calculating the bandwidth. The modifica¬ 
tions play very important role during GPU-based and SSE-based fast algorithm 
implementations for calculating of the bandwidth. 

4.1 Probability density function 

Let Ai be a random variable and let / be the aforementioned probability density 
function. The probability P that a random variable X takes a value in the given 
interval [a, b] is defined as follows: 


P{a < X < b) = j f{x)dx. (1) 

J a 

Density function must fulfill two basic properties: must be not negative over 
the whole domain, that is f{x) > 0 for every x and the value of the integral 
o must be exactly 1, that is P{—oo < X < -boo) = f{x)dx = 1. In the 
context of the main subject of the paper, we also need to know a formula for 
calculating mean value of the random variable. This is defined as follows: 

/ + 00 

xf{x)dx. (2) 

-CX) 

The above formulas can be easily generalized into two or more dimensions. In 
practice, three different cases can be considered: (a) if analytical formula of the 
PDF is known and the integrals © and dl]) can be solved analytically. In such a 
case the solution we get is the exact one. (b) if we know the analytical formula 
of the PDF, but solving the aforementioned integrals is very difficult or even 
not possible. In such a case one must use methods for numerical integration (a 
broad family of algorithms exist, for example rectangle rule, trapezoidal rule, 
Newton-Cotes formula, Gaussian quadrature, Monte Carlo methods and other), 
(c) if we don’t know at all analytical formula of the PDF, then the nonparametric 
methods for its estimation should be considered. 

4.2 Kernel density estimation 

Let us assume that we have a set of source points to be a sample from an unknown 
density function. Density estimate is simply the process of construction of an 
estimate of the density function from the observed data [3T]. There are two 
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main approaches to density estimation: parametric and nonparametric ones. The 
first assumes that the data are drawn from a known family of distributions. For 
example the normal distribution is fully described by only two parameters (mean 
value p and standard deviation a). All we need to do is to find estimates of these 
parameters from the observed data. In the paper this approach is not considered. 
In the second approach, we assume no preliminary knowledge about a form of 
a density function. It should be discovered from the set of source points. This 
approach is sometimes vividly described as Let the Data Speak for Themselves. 

A trivial example of a nonparametric PDF is histogram. However, its practical 
usability is rather poor (in the context of using them as PDFs estimators), 
because in general, and with its basic form, it is difficult to give an optimal 
number of the bins, their width and the starting point of the first bin. The shape 
of histogram is very sensitive to these three parameters. In that case, it is much 
more convenient to use the so called kernel density estimators (KDE) which are 
much more adequate for building nonparametric PDFs. Three classical books 
on KDEs are [snMis]. There are, however, known families of so called optimal 
histograms (e.g. v-optimal histograms [S]), but we don’t consider them in the 
paper. We only mention that their construction algorithms are very time and 
memory consuming. It should be also noted that basic histograms are commonly 
used in database management systems (DBMSs). This is a feature in cost based 
optimizers and help the optimizer to decide which SQL execution plan should 
be used to get the best query execution. 

Now let us consider a random variable X (in general d-dimensional) and let 
assume its probability density function / is not known. Its estimate, usually 
denoted by /, will be determined on the basis of a random sample of size n, 
that is Xi, X 2 , ..., Xn (our experimental data). In such a case, the d-dimensional 
kernel density estimator f{x, h) of the real density fix) for random sample 
Xi, Ar 2 ,..., Xn is given by the following formula: 

fix, h) = n~^ '^Khix- Xi) = n~^h~'^ ^ ^ ) (3) 

where 

Khiu) = hr^Kihr^u) (4) 

and 

Kiu) = (fl'n)~’^l'^exp (^) 

where n - number of samples, d - task dimensionality, x = ixi,X 2 , ■. ■ ,Xd)’^, 
Xi — iXn , X 2 i,, Xdi) , I — 1,2,...,7z. X\i , X 2 i, ■. ■, Xdi denote the con¬ 
secutive elements of d-dimensional vector X. h is a positive real number called 
smoothing parameter or bandwidth. A"(-) is the kernel function - a symetric but 
not necessarily positive function that integrates to one. In practical applications 
A"(-) is usually the Gaussian normal formula as given in ([S|). Other commonly 
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used kernel functions are Epanechnikov, uniform, triangular, biweight [24) . One 
can prove that selection of a particular kernel function is not critical, as all 
of them guarantee obtaining similar results (Epanechnikov kernel function is 
theoretically the most effective, but others are only slightly worse, see [35] for 
details). However, the bandwidth is the parameter which exhibits a strong in¬ 
fluence on the resulting estimate (shape of the curve). If we have bandwidth h 
we can determine the estimator f(x, h) of the unknown d-dimensional density 
function f{x) using the formula dSj). 

Equation ([3|) assumes that the bandwidth h is the scalar quantity, indepen¬ 
dently of the problem dimensionality. This is the simplest and the least complex 
variant of the bandwidth. Erom the other side, the most general and complex 
version assumes that the so called bandwidth matrix H is used instead of the 
bandwidth scalar h. The size of the H matrix is d x d. This matrix is positive 
definite and symmetric by definition. So, the equivalent of the formula (I3|) is now 
defined as follows: 

n n 

/(x,iJ) (x-X,) (6) 

2=1 2=1 


where 


Kh{u) = (7) 

and K{-) is defined by ([5]). 

Is easy to note that © is not a pure equivalent to 0 as for univariate case 
the 1x1 bandwidth matrix is H = h^. So, now we are dealing with so called 
’squared bandwidths’. 

A version between the simplest and the most complex is also considered 
in literature and is based on simplifying the unconstrained bandwidth matrix 
H to is constrained equivalent where all the off-diagonal entries are zeros by 
definition. In the paper we do not consider this case. All the possible versions of 
the bandwidth selectors are investigated in details in [^ . Below we only sum up 
the three main versions of the bandwidth matrices. 



o' 


h\ 0 


h\ hi2 

0 


, diagH = 

0 hi 

, H = 

hi2 hi 


Einally, one must remember that nonparametric estimators (to be effective) 
need more and more data as dimensionality increases. Here, a negative phe¬ 
nomenon called curse of dimensionality becomes a real problem. As a conse¬ 
quence, values f{x) calculated from ([3|) or ([S]) becomes inaccurate. The problem 
of determining the required sample size needed to achieve a given level of accu¬ 
racy is studied by some authors. See for example [HTm] . 

As an example of how KDE works consider a toy dataset of 8 data points 
x = {0, 1, 1.1, 1.5, 1.9, 2.8, 2.9, 3.5}. Three different PDEs based on these 
data are depicted in figure [1] It is easy to notice how the bandwidth h influences 
the shape of the PDF curve. Choosing the best value of h is not a trivial task and 
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this problem was and still is extensively studied in literature [31133135] . In Figure 
[1] lines in bold show the estimated PDFs, while normal lines show the shapes of 
individual kernel functions K{x) (Gaussian normal kernels). Dots represent the 
data points Xi. 


h = 0.15 


h = 0.4 


h = 0.8 





Fig. 1: An example of using kernel density estimators for determining the prob¬ 
ability density function 


4.3 Probability density functions in databases 

A commonly used logical structure in DW area is the multidimensional cube 
[14] . The particular dimensions can store both nominal (also called categorical) 
as well as numerical data (both discrete and continuous). In the case of nominal 
ones, determining all the possible aggregates is not trivial but feasible task. Of 
course, in practice usually not all possible aggregates are calculated, as the time 
and storage requirements would be prohibitive. In contrast, if a dimension stores 
numerical data one must arbitrary discretize them (by creating separate bins). 
Calculating an optimal number of bins and their width is not a trivial task. 
After discretization, the total number of aggregates which are (or will be in the 
future) possible to create is fixed. To avoid the need for preliminary discretization 
of data, one can consider using PDFs. As a matter of fact, to calculate them we 
must read all the data stored (or eventually, only a representative sample of the 
data). But, what is very important, we need to read them only once. Usually we 
can assume that statistical properties of the data remain constant in a period of 
time (days, months, sometimes even years). So, there is no need to recalculate 
PDFs after every SQL DML statement executed by a user (INSERT, UPDATE, 
DELETE). The obvious benefits of PDFs comparing to materialized aggregates 
is that permanent storing of the former is extremely easy and cheap. Everything 
what we need to store are either parameters of formula-based PDFs (for example 
mean value /i and standard deviation a in the case of Gaussian normal PDF) or 
output points f{x) for nonparametric-based PDFs. 

Now let us consider a relational table where an attribute columnl is numer¬ 
ical in nature. If we want to calculate the number of records [count aggregate 
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operator is SQL) in the range a < columnl < b one can use formulas given 
in section 01 If n is the total number of records, it is trivial to conclude that 
approximate count aggregate can be calculates as 

Count = n- f f{x)dx. (9) 

J a 

Similarly, the sum of all record {sum aggregate operator is SQL) in the given 
range can be calculates as 

Sum = n- xf{x)dx. (10) 

J a 

The similar reasoning leads to conclusion that the product Sum/Count is an 
equivalent to aggregate SQL’s average operator. The aforementioned three for¬ 
mulas can be immediately generalised into multidimensional case, that is for 2D 
case we have 


pd pb 

Count = n- / / f{x,y)dxdy, 

J c J a 


Sum = n ■ 


pd pb 

/ / H^,y)f{x,y)dxdy. 

J c J a 


( 11 ) 


Here h{x, y) is in general a function of two random variables. To calculate for 
example sum of the variable x, one have to set h{x,y) = x. Above we calcu¬ 
late the sum and the number of records within logical range (a < columnl < 
b) AND (c < column2 < d). 

The values of the integrals can be done analytically (if the analytical solution 
exists) or by applying a proper numerical integration method (some of them were 
mentioned above in section mj. In the case of nonparametric approach, only 
numerical integration is practicable. In m we give a few practical examples 
based on real datasets which proof the practical usefulness of using kernel based 
PDFs for calculating database aggregates. 


4.4 Formulas for bandwidth selection 

In the following section we present detailed mathematical formulas for calculat¬ 
ing bandwidths (optimal in some sense) using the PLUGIN and the LSCV meth¬ 
ods mentioned earlier. The PLUGIN method is designed only for ID problems, 
known in literature as univariate problems. Contrary to the PLUGIN method, 
the LSCV method is designed for both ID as well as nD problems (known in 
literature as multivariate problems). Three different LSCV versions can be con¬ 
sidered, while only two were implemented by the authors. The simplest one 
(with the smallest computational complexity) assumes that the bandwidth h is 
the scalar quantity, independently of the problem dimensionality (see equations 
m and m)- From the other side, the most computational complex version as¬ 
sumes that the bandwidth matrix H is used, instead of the scalar h (see equation 
m ). This matrix is positive definite and symmetric. 
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Below, in the next 3 subsections, we give a very condensed recipes for calculat¬ 
ing optimal bandwidth h and H using PLUGIN and LSCV approaches described 
briefly above in section [2?^ The LSCV one is presented in two variants: the sim¬ 
plified one for evaluating optimal h (that is if the density estimate is calculated 
from ([3]); this variant will be called LSCV_h) and the general multivariate variant 
of the method (that is if the density estimate is calculated from ([51) ; this variant 
will be called LSCV_H). We comment the individual mathematical formulas very 
briefly as this is beyond the scope of the paper. Every subsection is prefaced with 
very short overview of the main ideas of the methods. All the necessary details 
on the methods as well as details on deriving of particular formulas can be found 
in many source materials, see for example probably the most often cited books 


|3.^I31I33| . 


All the methods presented below determine the optimal bandwidth on the 
basis of the input random variable and commonly used optimality criterion based 
on minimization of mean integrated squared error (MISE) and its asymptotic 
approximation (AMISE). 


PLUGIN This method is used for calculating an optimal bandwidh h for uni¬ 
variate problems, that is applicable to formula m where d is set to 1. First 
we calculate the variance and the standard variation estimators of the input 
data (equations (IT3]) and (IT31) 1. Then we calculate some more complex formulas 
(equations from (1131) to (fT3]) l. The explanations of the essence of them is beyond 
the scope of the paper and can be found in many books on kernel estimators, 
for example see the three above-mentioned books. Finally, after completing all 
the necessary components we can substitute them into equation (HI to get the 
searched optimal bandwidth h. 

1. Calculate value of variance estimator: 



( 12 ) 


2. Calculate value of standard deviation estimator: 


a = \f^. 


(13) 


3. Calculate estimate of functional 



(14) 
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4. Calculate value of bandwidth of kernel estimator of the function (4th 
derivative of the function /, that is 


f - 2 K^( 0 ) \ 


K^{0) = - 


fJ‘2[ 
15 


(15) 


fi2{K) = l. 

5. Calculate estimate ^ 6 ( 51 ) of functional tf'g: 


n n 




si 


i=l j=l,i<j 


X, - Xj 
9i 


+ nK^^\0), (16) 


K^(x) = (x^ - 15x^ + A5x^ - 15) 

6. Calculate value of bandwidth of kernel estimator of the function 


( -2K‘^(Q) 

~ \h2{K)M9i) 
3 


1/7 


(17) 


^ ^ vS' 

miK) = L 

7. Calculate estimate '^A{g 2 ) of functional tfti: 


n n 


X,; - X, 


92 


^ 92 

A:^(x) = (a;'‘ - 6x2 

8. Calculate the final value of bandwidth h: 

k = ( 

I ^2(if)2>f'4(ff2)n^ 


+ nX(4)(0), (18) 


1/5 


(19) 


= V7- 
m{K) = r 


LSCV_h This method is used for calculating an optimal bandwidh h for both 
univariate and multivariate problems, that is applicable to formula Q where d 
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is set to any integer value qreter or equal to 1. First we calculate the covari¬ 
ance matrix (equation (1211) 1. its determinant and its inversion. Then we form 
an objective function (1241) which will be minimized according to the searched 
bandwidth h. After determining the search range of the bandwidth h (equation 
((^ 1 as well as the starting bandwidth h (equation (1^ 1. using the ’’brute force” 
strategy we search for such h which minimizes the objective function (1241) . The 
performance of the brute force method seems to be acceptable in almost all prac¬ 
tical applications. If, however it will turn out to be too slow in practice, a more 
smart (and faster) method can be used, for example the Golden ratio criterion. 

1. Calculate covariance matrix, given the input data. In columns one can find 
consecutive d-dimensinal vectors of our experimental input data. There are 
n such vectors: 


X = [Xi,X2,---X„] 


x\,l Xi^2 ■ ■ 

* 

X2.1 X2,2 ■ ■ 

■ ^2,n 

^d.l ^d,2 

* ^d,n 


Covariance matrix is equal to: 


erf CTi,2 • • 

■ <Xl,d 

(724 cr| • • 

■ <X2,d 

O'd,! (Xd,2 ■ ■ 

■ -1. 


where: 

— erf - is a variance of each dimension of the random variable, 

— crq,i 2 - is a covariance between random variables ii and * 2 , 


( 20 ) 


( 21 ) 


a? = 








i=i 


( 22 ) 


1 " 


n n 




i=i 


_ 1) E^*ibE^*2.i- (23) 

^ j=i i=i 


2. Calculate determinant of the covariance matrix det{S). 

3. Calculate inverse of the covariance matrix S\ . 

4. Let Xi be the i-th column of the input matrix X. The LSCV_h method 
requires to minimize the function g{h)\ 










tq St 
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g{h) = h 


— d 


n n 


2=1 


X,. - X, 




where 


T{u) = (K * K){u) - 2K{u), 


(24) 


(25) 


K{u) = {2 tt) '^^^det{S) ^^'^exp . 


(26) 


{K * K){u) = {4 :Tt) '^/‘^det{S) . (27) 


5. Calculate the approximate value of the bandwidth h: 


f RjK) 

° {p2{KyR{r)nJ 

R{K) 1 


\ l/(d+4) 


(28) 


P-2{KY 2'^7r'^/2(i2 ’ 

„ _ d[d + 2) 

J > 2'^+27r‘i/2 ■ 

6. Let the range in which we search for minimum of g{h) be heuristically found 
as: 


Z{ho) — [ho/ijAho] . 
The optimal bandwidth h is equal to: 


(29) 


argminheziho)9{h)- 


(30) 


LSCV_H This method is used for calculating the optimal bandwidh matrix 
for both univariate and multivariate problems, that is applicable to formula 
. In this variant of the LSCV method the objective function is defined by 
ration (EH). Now our goal is to find such the bandwich matrix H which will 
minimize this objective function. This is the classical nonlinear optimization 
problem and can be solved by using for example the well known Nelder-Mead 
method m- This method needs a starting matrix which can be calculated from 
the rule-of-thumb heuristic equation (EZD taken from [35) . 
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Two detail notes on evaluating the objective function (15^ are needed here. 


First, as we said above, Nelder-Mead method can be used for finding the optimal 
H bandwidth matrix. This method does not guarantee that in every step the 
current H is positive-definite. This requirement, however is necessary as the 
bandwidth H matrix components must be by definition positive scalars. So, the 
searching for the optimal H must be done over the space of all positive-definite 
matrices only. Moreover, while calculating the current value of ([H^ the inversion 
of H is needed, hence positive-definite requirement is a must. 

Second, we know that bandwidth matrix H is always a symmetric one (see 
®) and its size is d x d. So, only d{d + l)/2 independent entries exist. As a 
consequence there is no need to evaluate the objective function for the full H 
matrix. It is sufficient to use vech{H), where vech (vector half) operator takes a 
symmetric d x d matrix and stacks the lower triangular half into a single vector 
of length d{d + l)/2. That is for an example matrix we have 


1 4 7 

A= 2 5 8 , vech{A) =[1 2 3 5 6 9]. 
369 


(31) 


1. Let 


n 


n 



i=i j=i^i<j 


where 


Th{X, - X,) = {K * K)h{X, - Xj) - 2 Kh{X, - X,), (33) 


Kh{X,-Xj) 


(34) 



{K * K)h{X, - Xj) 


(35) 



R(K) = 2-"*7r-"*/2|^|-i/2_ 

2. Find H which minimizes g{H). Start from (17 is defined by (1211) '): 


(36) 


Hstart = (4/(d+ 


(37) 




18 


Density Estimations for AQP on SIMD Architectures 


4.5 Some formula modifications 

The equations for LSCV_h algorithm may be reformulated to require less oper¬ 
ations. Let us consider equation (1^ : 

K{u) = . 

As can be determined from equation (IMll . the u is always equal to . 

Let us therefore reformulate the equation (1^ : 


K{v) = K(v/h) = 

Let 

S{v) = v^E-^v. (39) 

If we substitute this into equation (1381) we obtain: 

k{v) = {2'K)-’^/‘^\E\-^/‘^exp ■ (40) 

Analogous changes can be made to equation (071) : 

{K + k){v) = (47r)“'^/^|A'|“^/^exp ■ (44) 

These changes can be next propagated to equations (071) and (051) : 

f{v)=T{^ ={k*k){v)-2k{v) (42) 


g{h) = h-<^ 


2n"^^ H T{Xi- Xj)+n-^R{K) 

i=i 


(43) 


It is easy to notice that S(v) values are scalars, and moreover they are constant, 
independent on the value of parameter h of the function g{h). Consequently, 
they may be precalculated for each combination of two vectors X at the start 
of the algorithm and used multiple times during the search for minimum of 
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g{h). Let us determine the complexity of calculating the g{h) function before 
and after modifications. Calculating of a single exponent value in either K{u) or 
{K *K){u) function requires 0{(P) operations (see section [5A]) . These functions 
need to be calculated 0{ii?) times, which leads to the complexity of 0{n^(f). Let 
Uh be the number of times g{h) function needs to be calculated during searching 
for its minimum. Total complexity of unmodified LSCV_h algorithm is therefore 
0{nhn^(P) 

Precalculating of a single S{v) value requires 0{(P) operations. n{n — l)/2 
of S{v) values need to be precomputed. Consequently precomputing of all S{v) 
values has a complexity of 0{n^(P). However, since S{v) values may be reused, 
computing of the g{h) value has only the complexity of 0{'n?). Consequently, 
total complexity of the modified LSCV_h algorithm is 0{n?{(P + Uh))- 

The solutions described above unfortunately cannot be used for optimizing 
of the LSCV_H algorithm. This is due to the fact that the expression {Xi — 
Xj)'^H~^{Xi — Xj) found in equations (IM)) and (which is an equivalent to 
S{v)) depends on the g{H) (equation (1321) 1 function argument. Consequently, 
this expression must be recomputed each time the g{H) function is computed. 


5 Optimization of bandwith selection on SIMD 
architectures 

In this section we describe parallel algorithms for bandwidth selection and imple¬ 
mentation details for two out of three implementations compared in this paper: 
SSE implementation and GPU implementation. Sequential implementation will 
be described in section [5] 


5.1 Identification of some compute-intensive parts in mathematical 
formulas 

Let us take a closer look at equations (IT^ . (fTHll . (ITi^ . and ( 021 ) • All of these 
equations (among other operations) compute sums of large number of values. As 
such sums are performed multiple times and constitute a large part of the number 
of basic matematical operations computed in these equations. Accelerating them 
would significantly increase algorithm performance. Consequently, in general we 
need an algorithm which given a single row matrix A, would compute: 

n 

E(A)=^A,. (44) 

i=l 

The process of using the same operation multiple times on an array of values 
to obtain a single value (sum, multiplication, mininimum, maximum, variance, 
count, average etc.) is called reduction of an array. Parallel reduction of large 
arrays is a known problem [MTK| . Basic algorithm as well as GPU and SSE 
implementations, are described in section 15.21 
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Let us consider the equation (Ha- This equation contains two sums. One 
of these sums is a sum of values of a scalar function computed based on values 
stored in a matrix. Formally, given a single row matrix A and a function fun{x), 
this sum is equivalent to: 


Rfun{A) = ^ fun{Ai). (45) 

i=l 

Parallel computation of such sums can be performed by using a simple modifi¬ 
cation of parallel reduction algorithms. For details refer to section [5751 

Let us now consider equations (USD and (na. Given a single row matrix A 
of size n and a function fun{x), both of these equations contain double sums of 
function values equivalent to: 


RRfun{A) = E E fun{Ai-Aj). (46) 

i=l 

As can be easily noticed, the function fun{x) is computed for a difference be¬ 
tween every combination of values from row matrix A. Parallel algorithms for 
computing such sums are given in section [5^ 

Similar sums can also be found in equations (13211 and (14311 . These sums, given 
a two dimensional A matrix, and function fun(x) are equivalent to: 

n n 

RR}^^{A) = J2 E (47) 

i=l 

In these equations however, each argument of the function fun{x) is a vector 
and computation of this function is much more complex. Moreover, in both cases 
function fun{x) can be expressed as: fun{x) = f un\[f un2{x)) where 

fun2{x) = x^Mx, (48) 

M is any matrix and funl(y) is any scalar function. Let us now consider equation 
(gSl)- Here, the function fun{x) is an equivalent of the function T{v) presented 
in equation dmi. Function T{v) is computed using functions K{v) (equation 
TO) and [K * K){v) (equation (IdlTl L These functions in turn can be computed 
based on a value of the function S{v) (equation (15511 1 which is an equivalent 
of fun2{x). As was mentioned earlier in section IT751 the S{v) values can be 
precomputed and used each time equation (I43II is computed. We can therefore 
split the problem of computing the sums in equation (H51l into two problems: (a) 
computing fun2{x) {S{v)) values (see section [531l and (b) finding a sum of values 
of a scalar function introduced earlier. Section l6^ presents details on how to use 
precomputed S{v) values in LSCV_h algorithm. Similar observations can be also 
made for equation (I32II . Here, the function fun{x) is an equivalent of the function 


^ The V superscript stands for vector. 
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Tni^Xi—Xj) presented in equation (1331) . Function TniXi—Xj) is computed using 
functions KniXi — Xj) (equation (IMl) i and {K * K)H{Xi — Xj) (equation (IHHl) '). 
Exponents of both functions KniXi — Xj) and {K * K)H{Xi — Xj) contain 
{Xi — Xj^H~^[Xi — Xj) which is an equivalent of fun2(x). Unfortunately 
here values of fun2 cannot be precomputed as matrix H~^ is different every 
time equation (15^ is computed. Nonetheless solutions presented in section [53] 
can also be used in this case. Section 16.31 presents how to efficiently compute 
equation (1!?^ . 

5.2 Parallel reduction of array values R{A) 

Basic algorithm In this section we briefly introduce basic ideas behind the 
well known problem of parallel reduction of array values m, i e. given a single 
row matrix A, we present an algorithm which computes 

n 

R{A) = Y,A.. 

i=l 

Basic idea behind the parallel reduction of arrays is presented in Figure^] At the 
beginning the array contains 8 different values. As a first step, pairs of neighbor 
values from the input array should be added in parallel. Results are stored in 
the first half of the array. This process is repeated but each time it runs only 
on the first half of the array which was the input to the previous step. The 
algorithm stops when the input part of an array is composed on only a single 
value, which is the result of the parallel reduction. This approach may be easily 
generalized to larger arrays of values, even of non power-of-two sizes. In such 
a case, values whose “pairs” land outside of array bounds, are left unchanged. 
The above generic method for value reduction will be adapted in the following 
sections to accelerate several specihc parts of the equations. 

GPU implementation GPU based algorithm for reduction of a large number 
of values was proposed in |15j where the author introduces several low-level 
optimizations which include: 

— utilizing a small but very fast shared memory available on GPUs (see section 

[321), 

— using alternative reduction scheme (pairs of added values are not neighbours 
in the array), 

— unrolling of loops and utilizing templates to achieve very efhcient gpu-kernel 
for reduction of arrays. 

Paper m presents a gpu-kernel in GUDA G code, which performs reduction 
of up to 1024 values (using 512 threads) stored in the shared memory array. 
Graphics cards with CC>2.0 may start up to 1024 threads per block, so this 
code may also be appropriately extended to support arrays of size up to 2048 
values. The same paper also introduces a schema which utilizes this gpu-kernel 
to support arbitrary sized arrays: 
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Step 0 


Step 1 


Step 2 


Step 3 



Fig. 2: Parallel reduction schema 


1. Divide the input dataset into subsets of consecutive values of size equal to 
some power of 2 (1024 or 2048 depending on compute capability). 

2. Start a one dimensional block of threads for each of these subsets (in a one 
dimensional grid). Each block should be composed of the number of threads 
equal to the half of the size of the subset. 

3. Each thread in each block should retrieve two values from the input array 
(which is initially stored in the slow global memory), add them and store 
the result in the fast, shared memory. If for some reason grid of threads is 
not big enough to process all data, each thread should add more then two 
values. 

4. Next, the threads should perform the parallel reduction algorithm on the 
values stored in the shared memory. 

5. The resulting value (for each block) should be stored in the output array in 
the global memory. 

6. The above steps should be repeated on the output array of the previous 
iteration until only a single value is obtained. 


SSE implementation Our SSE implementation uses a horizontal ad¬ 
dition instrinsic instruction __ml28 _mmJiadd_ps(__ml28 x,__ml28 y) where 
X and y are two 128 bit vectors of four floats. Given two vectors 
a= [al, a2, a3, a4] and b= [a5, a6, a7, a8] the instruction computes one vector 
containing [al+a2,a3+a4,a5+a6,a7+a8], i.e. it adds neighbour values. As may 
be easily noticed, the __ml28 _mmJiadd_ps(__ml28 x,_ml28 y) instruction per¬ 
forms first iteration of reduction schema presented in Figure [2] For bigger ar¬ 
rays, each iteration, except the last two (reducing four values into one), may 
be performed by executing the instruction on consecutive elements of the array 
appropriate number of times. 
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The algorithm described above requires many memory accesses and may 
therefore suffer delays stemming from memory access latency. The problem 
would be largely reduced if the whole temporary array would fit within the 
CPU’s cache. To increase probability of such situation, the processed array is 
split into many chunks of arbitrary small (but power-of-two) sizes, such that 
would fit in the cache. Each such chunk is processed independently and the 
results are then reduced using the same algorithm (similarly as in the GPU 
approach presented above). 

Notice that the described algorithm has a very low parallelism when com¬ 
pared with the GPU approach (only 4 values are added at the same time). To 
work around this, we start a number of independent threads, each of which pro¬ 
cesses a subset of chunks of the processed array. The number of threads depen¬ 
dents on the number of CPU cores and core capability (such as HyperThreading 

m)- 

Careful reader might notice that SSE implementation could be implemented 
in a much simpler way by using a simple sequential reduction and standard vec¬ 
tor addition intrinsic instruction __ml28 _mm_add_ps(__ml28 x,__ml28 y) where 
X and y are two 128 bit vectors of four floats. Moreover, by using this method 
memory access latency could be largely reduced. However, as with all of float¬ 
ing point mathematical computations, reduction algorithms are a subject to 
numerical rounding errors. Sequential reduction has an error constant of 0(n), 
whereas the pairwise reduction algorithm presented above has an error constant 
0 (log 2 n) [T7] and consequently yields smaller errors. One could also argue, that 
there exists another sequential reduction algorithm introduced by Kahan in na, 
which has an error constant of 0(1) |17j . This algorithm however requires sev¬ 
eral times more floating operations and is therefore much slower then simple 
sequential reduction and pairwise reduction. 

5.3 Parallel reduction of scalar function values Rfun(A) 

Basic algorithm In this section we present an algorithm for computing sums 
equivalent to: 

n 

Rf un fun{Ai), 

i=l 

where A is any single row matrix and fun is any scalar function. Notice, that the 
function fun for each iteration is computed independently. Therefore, values of 
this function may be easily computed in parallel and the obtained values can be 
reduced (summed up) later using the previously introduced reduction algorithms 
(MapReduce approach [S]). However, to remove the need for storing function 
values in memory we suggest to slightly modify the basic array reduction schema 
described in section 15.21 in such a way that fun function values are computed 
and added on-the-fly. 

GPU implementation GPU implementation is a straightforward modification 
of the scheme presented in section 15.21 In step O after the values are retrieved 
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from the global memory, fun function values are computed. Afterwards, the com¬ 
puted values are added and stored in the shared memory. This is implemented as 
a separate gpu-kernel, as subsequent reduction (see step|6]) is performed without 
computing of fun function values. 

SSE implementation Presence of fun function in SSE implementation of re¬ 
duction algorithm requires minor adjustments in the basic reduction function. 
First, a SIMD version of the fun function needs to be implemented. This new 
version should process four values passed in a single __ml28 vector, and return 
four values also as a single __ml28 vector. The only modification to the SSE 
reduction implementation requires that in the first iteration of the algorithm, 
the SIMD version of the fun function is used on every four value vector re¬ 
trieved from memory and obtained result is subsequently used in the reduction 
algorithm. 

5.4 Parallel reduction of scalar function in nested sums RRfuniA) 

Basic algorithm In this section we present an algorithm for computing sums 
equivalent to: 

n n 

RRfunjA) = fun{Ai-Aj), 

i=l 

where A is any single row matrix and fun is any scalar function. Based on the 
solutions presented in [1], we propose the following parallel schema for comput¬ 
ing function values, which utilizes cache memory. First, let us notice that fun 
function values are computed only for indexes i and j such that i < j and might 
therefore be stored in an upper triangular matrix of size n x n. Lets take a look 
at Figure [31 The A matrix is divided into chunks of small (power-of-two) size 
k (recall that here A matrix contains only a single row). The triangular matrix 
of fun function values is divided into tiles of size k x k. Notice, that tiles on 
the diagonal contain redundant positions which should not be computed. Each 
tile corresponds to some combination of two A matrix chunks denoted E and 
F. For each tile, a group oi k x k threads is started. First, a subset of threads 
in a group copies the corresponding chunks into the cache memory. Next, each 
thread in the tile computes the function value based on two A matrix values 
retrieved from the cache. Redundant threads (below the main diagonal) return 
function value 0. Next, threads in each tile cooperate to reduce the computed 
function values into a single value using parallel reduction algorithm introduced 
earlier. Consequently, each tile yields one reduced value. These values are then 
formed into a single result, by once again using a parallel reduction algorithm. 

The remaining problem is: what to do if the size of the A matrix cannot 
be expressed as a multiple of the chosen k value? In such a case, the matrix 
should be extended to the size equal to the nearest multiple of k. Redundant 
threads allocated to process tiles corresponding to last chunk of the array storing 
A matrix should output zero as a function value. No additional changes to the 
above schema are required. 
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Fig. 3: Parallel cache aware computation of values of two variable functions 


GPU implementation The proposed parallel schema seems to fit the GPU 
architecture and CUBA API pretty well. Each tile can be processed by a square 
block of threads and the shared memory can be used as a cache. Unfortunately, 
the computation grids in CUBA API can only be linear or rectangular (no 
triangular grids). To solve this problem we propose to run a one dimensional 
grid of blocks, and based on the block number, compute its position in the 
triangular matrix (see Figure 0]). To find the position of the tile in the triangle 
matrix based on its number on the one dimensional grid, we propose to use 
the equations (l4^ and (l50l) (see appendix 1X1 for derivation of these equations), 
where bx is the block number, I is the corresponding tile column and q is the 
corresponding tile row: 


^8bx + 9 — 3 


(49) 


q = bx — 


l{l + l) 
2 


(50) 


Notice, that as the number of blocks in one dimensional grid cannot exceed 
65535 (current GPU limitation), the number of tiles in one dimension of the 
triangle matrix cannot exceed 360. This in turn limits the number of values 
in the A matrix to only 360fc columns. To solve this problem we allocate two 






















































































26 


Density Estimations for AQP on SIMD Architectures 


dimensional grids which are composed of at least the appropriate number of 
blocks and then we find the “one dimensional” number of the block based on its 
position in the two dimensional grid. This new position is used in the subsequent 
computations in equations (l4^ and (l50l) . The number of blocks started this way 
may be greater than required. Therefore each thread must detect whether it 
belongs to one of such redundant blocks and abort computations if necessary. 

Given NVIDIA graphics card limitations, A: = 16 (256 thread blocks) could 
be used for graphics cards with CC<1.3 and k = 32 (1024 thread blocks) for 
graphics cards with CC>2.0. However, as was observed in m, given an array of 
n values, only n/2 threads are used during reduction. Therefore, after computing 
of function values in a tile, only half of the threads in each block would be used 
in subsequent tile values reduction. To solve this problem we propose to use 
k = 32 regardless of graphics cards compute capability, but use blocks of only 
256 threads to process tiles. Each thread should compute four fun function 
values and add them. This way we allow our code to be run on all graphics cards 
and achieve better thread utilization at the same time. 



Fig. 4: Conversion of block number to its position in upper triangular matrix 


SSE implementation In our SSE implementation each tile is processed by a 
single thread, though some processing parallelism is achieved due to the usage 
of SSE SIMD instructions (similarly as in section [531 SIMD version of the func¬ 
tion fun is needed). Here, cache is implemented as two local arrays E and F 
(corresponding to chunks E and F) which are read multiple times during com¬ 
puting of tile values and therefore are probably cached. A matrix is divided into 
16 value chunks (fc = 16) and consequently each tile is composed of 256 values. 
However, each such chunk is divided into 4 four value vectors stored in SSE 
__ml28 variables. Processing of each tile is illustrated in Figure [5] and described 
below: 
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Fig. 5: Processing of a tile in SSE implementation 


1. First, the two chunks corresponding to a tile are copied to local arrays E 
and F. 

2. Next, the tile is processed in four stripes. Each stripe is processed as follows: 

(a) A four value vector is retrieved from the E local array which contains 
“row values”. 

(b) Next, these four values are copied into four __ml28 variables (subse¬ 
quently called row variables). 

(c) Each row of a stripe is processed in four steps, each of which includes 
the following operations: (1) retrieve a four value vector from F ar¬ 
ray, (2) compute a difference between appropriate row variable and the 
retrieved vector using __ml28 _mm_sub_ps (__ml28 x, __ml28 y) instruc¬ 
tion, (3) compute four fun function values using SIMD version of the 
fun function and (4) add the obtained results into a single accumula¬ 
tor __ml28 type variable using __ml28 _mm_add_ps(_ml28 x, __ml28 y) 
instruction. 

3. After all stripes are processed, the accumulator contains four values. These 
values are added to obtain final result of tile processing. 

Processing of a tile that lies on main diagonal is similar, but several modifi¬ 
cations are made. First, only one chunk is retrieved into local array (obviously). 
Moreover, loops are altered to not process parts of stripes below the main di¬ 
agonal. Notice however, that some “below diagonal” values in small 4x4 tiles 
are still computed. In such cases, each vector computed by the reduced function 
is multiplied by appropriate vectors composed of zeroes and ones to reset the 
unwanted function results. 
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The tiles are processed concurrently on a number of threads which are pro¬ 
cessed most efficiently on a CPU (depending on the number of cores and whether 
the HyperThreading [26] capability is available or not). 


5.5 Computing of fun2 function values in RR'^^^(A) 

Basic algorithm Let us recall that in equations (15^ and (H51) the sums equiv¬ 
alent to: 


RR}uniA)=Y^ fun{A,^i-A,^j) =Y Y funl{fun 2 {A,^i-A,j)) 

i=l i=l j = 

can be found, where A is any matrix, funl is any scalar function and fun2{x) = 
Mx where M is also any matrix. As was suggested in section [??T] parallel pro¬ 
cessing of such sums can be split into two problems: computing fun2 function 
values and then reducing them using previously introduced algorithm. Conse¬ 
quently, we need an algorithm which given an A matrix would find a triangular 
matrix B = [bij], i = 1, ... ,n, j = i + 1, ... ^n, such that: 

bi,j = fun2{A:^i - A.,j). 

As each fun2 function value may be computed independently, parallel compu¬ 
tation algorithm seems obvious. To achieve better performance a cache aware 
solution similar to the one presented in section 15.41 may be used. Consider the 
schema presented in Figure [HI A matrix is divided into chunks of k vertical vec¬ 
tors (columns). The triangular result matrix is divided into tiles of size k x k 
(notice, that tiles on the matrix diagonal contain excessive positions). Each tile 
corresponds to some combination of two A matrix chunks. Row of a tile within 
the triangular matrix is denoted as q and column is denoted as 1. For each tile, a 
group oi k X k threads is started. First, a subset of threads in a block copies the 
corresponding chunks into the cache memory. Next, each thread in the tile com¬ 
putes the function value based on two vectors retrieved from the cached chunks. 
Each thread detects whether it is over or on the main diagonal or not. If it is 
below the diagonal it stops further computations. If it is over the main diago¬ 
nal it computes fun2 function value and stores it in the output array. Linear 
position in the output array may be computed using equation for sum of arith¬ 
metic progression, based on the threads coordinates within the triangular array. 
Notice that the order of stored values is unimportant, as they are subsequently 
only arguments for functions whose values are later reduced (summed up). 

We shall now derive an efficient order of performing operations needed to 
compute fun2 function values within a tile for an array storing A matrix that is 
row-major aligned in the computers memory. For simplicity let us assume that 
the considered tile does not lie on the main diagonal. We shall denote the tile of 
the matrix containing fun2 function values as a fc x fc submatrix Y. Each such 
tile corresponds to two d x k {d is the number of rows in A matrix) chunks E 
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Fig. 6: Parallel computation of fun2 function values 


and F of the A matrix. Let us assume, that chunks E and F start at columns 
qk + I and + 1 respectively. Let: 


i = qk F r, 
j = Ik + p 


(51) 


where r,p = 1,..., fc. Consequently = A.^gk+r = E-^r and A-j = A-^k+p = 
F-^p. Let = E-^r — F.^p = A-^i — A.j be all of the arguments of the fun2 func¬ 
tion within a tile. From the definition of function fun2, and the new notations 
introduced above we know that: 


y^p = fun2{v^’P) = {v^’PfMv^’P. (52) 

Let us extract the first matrix multiplication operation from the equation (1521) : 

The value is a horizontal vector of d scalars: 


z 


r,p 




.,d 


(54) 
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where each value is a result of a dot product between vector and a-th 
colnmn of the M matrix: 




TOc 


c=l 


Let us now substitute the equation (iMl) into equation (l5^ : 

yr,p = = [z:’na=U„„,V^’^. 


(55) 


(56) 


As v^’P is a vertical vector of d values, the above expression is a dot prodnct of 
vectors z'"’P and v^’P: 


yr,p = [z:’\=,A<’\ 


a—1 


If we substitute equation (1551) into equation (1571) we obtain: 


(57) 


d / d 

yr,p = XI ( XI 

a—1 \c—1 


V 


r,p 

a * 


(58) 


Recall that v^’P = ex,r — fx,p- If we substitute this into equation (15^ we obtain: 


d / d 

yr,p = ( X!(^c,r 

a—1 \c—1 



/a.p)- 


(59) 


As each value is computed independently of other y^.p values, we can extend 
the above equation to compute the whole row Rr,: of Y matrix. This is accom¬ 
plished by replacing each occurence of the column number p with the colon which 
means “all available values”. To retain correctness of terms that are not depen¬ 
dent on p (such as Cc.r) we introduce the following notation. By [x\k we denote 
a horizontal vector of k values equal to x. All terms that are not dependent on 
p are converted into horizontal vectors of k values. Consequently, the row r of 
the tile matrix Y may be computed as follows: 


d / d 

Yr^: = ^ ( I ^ ^ ([^c,r]fc Rc,:)^c,a 

a—1 \c—1 

Notice, that equation (1601) expresses a single tile row in terms of either single 
matrix values (ex,r and dc^a) or chunk rows (Tic,:)- Let us rewrite the above 
equation in algorithmic form: 


^ i[ea,r]k - Fa,:) (60) 


For each r = 1,..., fc perform the following steps: 

1- Yr,: ■<— [0]fc 

2. for each a = 1,... ,d perform the following steps: 
(a) part ^ [0]^ 
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(b) for each c = 1,..., d perform the following step: 

part part + me,a * ([Cc.rjfc - Fc^-.) 

(c) Fe,: f- Yr,, + part * ([eo.r]fc - Fa,:) 

3. Output Yr,: 

As we assume row-major order storage of the A matrix, rows F^,: are stored 
in linear portions of the memory, which allows for efficient accesses. Notice that 
each access to a chunk row is accompanied by an access to a single value in both 
M matrix and second chunk (E). These accesses to memory unfortunately are 
not to consecutive memory addresses. Notice however that there are only two 
such accesses per one large linear access to chunk row. Moreover, as M and E 
matrices are small, they can be easily fit within the cache memory for faster 
accesses. 

GPU implementation Our GPU implementation is a straightforward imple¬ 
mentation of the parallel schema presented in section [57^ Each tile is processed 
by a group of 256 threads {k = 16). We have also limited lengths of vectors to up 
to 16 values (d < 16) to simplify copying of data to shared memory (acting as 
cache) and subsequent processing (256 threads can copy 16 vectors of 16 values 
in parallel). As CUBA threads are not processed independently (they are run in 
32 thread SIMD groups) and are a subject to memory access restrictions, loops 
computing fun2 function values, as well as global memory to shared memory 
copying code, have to be carefully designed and data has to be properly layed 
out in memory in order to avoid serialization of memory accesses. 

Let us start with copying of data. A matrix is stored in an array in GPUs 
global memory in row major order. Consequently, subsequent values in the 
memory belong to subsequent matrix columns. Recall that CUBA threads are 
grouped in warps. Consider a single memory access instruction. Depending on 
the compute capability of the graphics card one memory transaction is performed 
per halfwarp (compute capability < 1.3) or one per warp (compute capability 
> 2.0), as long as each accessed address lies within the same 128B segment within 
the global memory. On graphics cards with compute capability (< 1.1) additional 
restrictions on accessed addresses are imposed. Given the fact that k = 16, 16 
columns should be copied to the shared memory. To adhere to the memory ac¬ 
cess requirements, threads access consecutive memory addresses (i.e. they copy 
“rows” of the processed chunk of matrix). Gonsequently, a single thread warp 
will copy two rows of data. For GPUs with GG<1.3, a memory transaction per 
halfwarp is perfomed and consequently, each row will be retrieved in a single 
64B memory transaction. For GPUs with CC>2.0, even though one 128B trans¬ 
action can be performed per warp, here two 128B transactions will be made, 
as each row is in different memory segment. Gonsequently for graphics cards 
with CC>2.0, memory retrieval is not as efficient as for the graphics cards with 
CC<1.3 (unless we increase k to 32). However, the excessive retrieved rows will 
be cached and might benefit other thread blocks. Moreover, this step is short, 
when compared to next computation steps and the delay should be negligible. 
Copying of data from global memory to shared memory is depicted in Figure [71 
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Fig. 7: Copying data to shared memory 


To compute values M matrix is needed as well. Array storing M matrix 
is stored in row major order in constant memory. Let us assume that the shared 
memory array storing columns from the chunk corresponding to the “row” coor¬ 
dinate of the triangle matrix tile is denoted E whereas the shared memory array 
storing columns from the chunk corresponding to the “column” coordinate of 
the triangle matrix tile is denoted F. By tx and ty we denote thread coordinates 
within a block. The main loop of a single thread computing fun2 function value 
is as follows: 

1. p ^ tx 

2. r ty 

3. yrp ^ 0 

4. For each consecutive row a 

(a) part ■«— 0 

(b) For each consecutive row c 

part •«— part + M[c * d + a] * {E[c * k + r] — F[c * k + p]) 

(c) yrp yrp+ = part + {E[a * k + r] — F[a * k + p\) 

It is easy to notice that this algorithm is a straightforward implementation 
of the equation (1591) , where the r and p coordinates of a computed value within 
a tile correspond to threads location within a block. Notice, that each thread in 
a warp will access the same value in the M array at each iteration of the loop. 
This is an optimal access pattern for the constant memory where this matrix is 
stored. 

Let us now analyse the access patterns to the shared memory arrays E and F. 
Data in these arrays is stored in row major order. This is due to the fact that A 
matrix is stored in row major order in the global memory and code that retrieves 
data from A matrix into arrays E and F retains this order. Consequently, first k 
values in both of these arrays are the values of the first row of the corresponding 
chunk of the A matrix. Next k values correspond to the second row, and so on. 
Consequently, consecutive values of each column are stored every k values. Such 
storage allows our gpu-kernel to access data in the arrays E and F optimally. 
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We shall start with an explanation for graphics cards with compute capabil¬ 
ities < 1.3. Let us assume that k = 16. Consequently, each half-warp processes 
one row of a block. As conflicts may only appear between threads in a single half¬ 
warp, our subsequent analysis will consider only one row of a block. In Table[T]we 
present for each thread in a halfwarp its corresponding p and r [tx and ty), array 
address which will be accessed in both E and F arrays as well as corresponding 
shared memory banks. Each consecutive thread is assigned consecutive tx values 
[1]. Given the fact, that there are 16 shared memory banks, each thread in the 
half-warp accesses array F through a different bank, which in turn guarantees 
that there are no conflicts. Let us now consider the E array. It would seem, 
that in each iteration every thread accesses the same bank. However, it not only 
accesses the same bank, but also the same address. Consequently, the broadcast 
mechanism of shared memory may be used and all such reads are performed in 
a single memory transaction within a half warp. 

For graphics cards with compute capability > 2.0 explanations are similar. 
The graphics cards with >2.0 have 32 shared memory banks, and each memory 
transaction is performed per warp not per half warp. Given this, all of the 
above explanations are correct if fc = 32. It is interesting however to notice 
that even if fc = 16, our gpu-kernel is also efficient. Please take a look at Table 
[21 Similarly as in Table |T] we present for each thread in a warp its corresponding 
p and r (tx and ty), array address which will be accessed in both E and F 
arrays as well as corresponding shared memory banks. If fc = 16 than each warp 
processes two rows of a block. This causes 16 2-way conflicts between threads 
accessing the same values in the F array (16 pairs of threads access the same 
bank), and two 16way conflicts between threads accessing the E array (two 
groups of 16 threads access the same bank). However, it may also be noticed 
that in all of these conflicts the same values are accessed and may therefore 
be broadcasted. The graphics cards with GG> 2.0, have an improved broadcast 
mechanism in which all broadcasts are processed in a single shared memory 
transaction. Consequently, the described conflicts will not cause memory access 
serialization. 


Table 1: Breakdown of memory accesses within a single halfwarp on GPUs with 
compute capability < 1.3 (fc = 16) 


thread in halfwarp 

0 

1 

2 

14 

15 

tx = p 

0 

1 

2 

14 

15 

ty = r 

g 

g 

9 

g 

g 

F address c* k + p 

16c -b 0 16c -k 1 16c -t 2 ... 

16c-b 14 16c-b 15 

F bank 

0 

1 

2 

14 

15 

E address c* k + g 

16c -b g 16c -b g 16c + g ... 

16c -b g 

16c -b g 

E bank 

9 

g 

g 

9 

9 







34 


Density Estimations for AQP on SIMD Architectures 


Table 2: Breakdown of memory accesses within a single halfwarp on GPUs with 
compute capability > 2.0 (fc = 16) 


thread in warp 

0 

1 

2 

14 

15 

tx = p 

0 

1 

2 

14 

15 

ty = r 

g 

5 

5 

5 

5 

F address c* k + p 

16c -to 

16c -t 1 

16c -k 2 ... 

16c -k 14 

16c-k 15 

F bank 

0 

1 

2 

14 

15 

E address c* k + r 

16c -1- g 

16c -1- g 

16c -k 5 

16c -k 3 

16c -k 3 

E bank 

g 

g 

5 

5 

5 

thread in warp 

16 

17 

18 

30 

31 

tx = p 

0 

1 

2 

14 

15 

ty ^r 

5 + 1 

5 + 1 

5 + 1 

5 + 1 

5 + 1 

F address c* k + p 

16c -to 

16c -k 1 

16c -k 2 ... 

16c -k 14 

16c-k 15 

F bank 

0 

1 

2 

14 

15 

E address c* k + r 

16c -1- 3 -|- 1 

16c -k 3 -k 1 

16c -k 3 -k 1 ... 

16c -k 5 + 1 

16c -k 5 + 1 

E bank 

5 + 1 

5+1 

5 + 1 ■■■ 

5 + 1 

5 + 1 


SSE implementation SSE implementation is a straightforward implementa¬ 
tion of the generic algorithm presented in section 15.51 Let us start with descrip¬ 
tion of processing of the off-main diagonal tiles. Here we also use k = 16. First, 
data is copied into local arrays (which represent E and F matrices) which will 
hopefully be cached. Subsequent code is just an implementation of the solution 
from section [5.51 One important thing that should be noted is that we use four 
variables of type __ml28 to implement vector Yj.,: and another four variables to 
implement vector part. Main diagonal tiles are processed similarly by a spe¬ 
cialized variant of the algorithm. The main differences are that code part that 
outputs results, ignores the excessive values and processing of parts of Ur,: and 
part vectors is omitted, when applicable. 

Utilizing multiple cores, as previously involves running the two variants of 
the tile processing algorithm in multiple threads. 

6 Algorithm implementations 

In this section we describe how to utilize the algorithms presented in section [5] 
to create efficient SSE and GPU implementations of PLUGIN, LSGV_h and 
LSGV_H algorithms. 

6.1 PLUGIN 

In our PLUGIN algorithm implementation we utilize parallel reduction algo¬ 
rithms presented in sections 15.21 and 15.31 to compute variance estimator (step 1, 
section 16.11 equation (1121) 1 and algorithm presented in section [5^ for computing 
sums in steps 5 and 7 (section IHUl equations (ITCl) and (fTSll l. Remaining steps 
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(2,3,4,6 and 8) are all simple equations that are inherently sequential and there¬ 
fore cannot be further optimized. Nonetheless they require very small number 
of operations and can therefore be performed on CPU in negligible time. 

6.2 LSCV_h 

Our implementations of the LSCV_h algorithm use the modihed equations pre¬ 
sented in section [4. 5 1 Consequently, the algorithm is performed using the follow¬ 
ing steps: 

1. Compute covariance matrix of matrix X: S (see equations (I21L and 

m)- 

2. Compute determinant of the covariance matrix S: det{S). 

3. Compute inverse of the covariance matrix E: E~^. 

4. Compute the approximate value of the bandwidth (see equation (1^ 1. 

5. Determine the range in which we search for a minimum of the objective 
function g{h) (see equation (l29l) l. 

6. Compute for all = Xi — Xj such that i = and j = 

i + 1,... ,n (see equation dMI). 

7. Search for minimum of g{h) objective function within the range computed 
previously (step 5). Each time objective is computed, its modified version 
(see equation (H51) l should be used, which can be computed based on pre¬ 
computed values of the S{v) function. 

Steps [T] to [5] of the algorithm in all implementations are performed sequen¬ 
tially on CPU without using SSE instructions. Though computing of covariance 
matrix could be performed easily in parallel by using an algorithm similar to the 
one presented in section [575] we have not implemented this as this step takes very 
little time when compared to the last steps of the LSCV_h algorithm presented 
above. Values of S(v) function are precomputed using the algorithm presented 
in section [5.51 The last step (minimization of g{h) function) is performed by a 
“brute force” search for a minimum on a grid, where the density of a grid is 
based on a user specified parameter and should be sufficiently dense. Note that 
as was stated in section OI other approaches to minimization of g{h) objective 
function are also possible. In this paper however we present implementations of 
the “brute force” method. 


GPU implementation In GPU implementation, searching for minimum of 
g{h) is performed as follows. First, based on the grid density and the width 
of the range Z{ho) (equation (05)) 1 the number of different h values to be 
tested (denoted Uh) are determined. Next, the number of one dimensional 
blocks needed to perform reduction of the nested sums in equation (H5I) (de¬ 
noted n}y) are determined. Given the block size bs this may be computed as: 
TT-b = \{n{n— l)/2)/(26s)]. Given these values, a two dimensional computing 
grid is started, where each row corresponds to a single tested h value (conse¬ 
quently there are nh rows). Each row is composed of Ub one dimensional blocks. 
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Parallel computation of g{h) values for all tested arguments is performed simi¬ 
larly as in algorithm presented in section [5.31 The main differences here are as 
follows: 

— Each thread based on the y coordinate of its block within a computation 
grid computes tested argument h (its not retrieved from the memory). 

— Reduction is performed on the precomputed S{v) values. For each retrieved 
S{v) value and based on the grid row dependent h argument value, the 
function T{v) is computed and these computed values are then added. 

— Reduction is performed independently in each row of the computational grid, 
i.e. each row reduces the same set of S{v) values but for different h argument 
value. 

Similarly as in section 15.31 reduction is performed in each block, so each 
started block yields one reduced value. An Uh x matrix of values obtained 
in this way is stored in global memory. As we expect a single value per row, 
not a single value per block, subsequent reduction is performed in each row 
independently. This process is repeated until only a single value per grid row is 
obtained. The obtained values represent nested sums from the equation (1^ . To 
find the final g{h) values, for each value obtained during reduction step, a single 
thread is started, which performs remaining operations of the equation (1431) . 
Computed g{h) values are copied to the computers memory and the argument 
for which the function g{h) has minimal value is found. Notice that the last 
operation could also be perfomed in parallel on GPU. However, this step can be 
performed in negligible time so there is no need to accelerate it. 

Now we would like to address a problem that comes from GPU limitations. 
The computation grid can have no more than 65535 rows and columns. Con¬ 
sequently, if Ufi or rib cross this boundary, the corresponding computation grid 
cannot be started. Solving the problem with too big rib value is pretty simple. 
If rib is bigger than 65535, only 65535 blocks per grid row are started. The al¬ 
gorithm for reduction of values presented in [T^ adds more than two values 
per thread if there are more values to be reduced, than two times the available 
number of threads. If rih is bigger than 65535 the set of h values to be tested 
is divided into portions of size 65535 and the computations described above are 
performed sequentially for each such portion. 


SSE implementation As SSE implementation has much less threads to uti¬ 
lize, we have used a simpler approach. Similarly, as in GPU implementation we 
first determine the value nh- In contrast to GPU approach however, we do not 
compute g{h) function for each h argument in parallel. This is performed sequen¬ 
tially in a loop, and only computation of a single g{h) function is parallelized. To 
perform nested sums from the equation (l4^ . we use the algorithm presented in 
sectionThis algorithm is started on the precomputed set of S{v) values and 
for each such value the function T[v) is computed and subsequently reduced into 
a single value. Based on this value, the final value of g{h) function is computed. 
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The loop keeps track of the lowest g{h) value found up-to-date (and correspond¬ 
ing h value) and after the loop is finished the result of the LSCV_h algorithm is 
known. 


6.3 LSCV_H 

The LSCV_H algorithm can use any numerical function minimization algorithm 
capable of minimizing g{H) function (see equation (1321) 1. Such algorithms are 
often inherently sequential as they are based on iterative improvement of results 
of previous iteration. Consequently, the only parallel approach to such algorithms 
would require to start multiple parallel instances of this algorithm, each starting 
from a different starting point in hope of finding better result after a fixed number 
of steps. However, the number of steps needed to converge to a local optimum 
cannot be reduced. Possibly, for some specific algorithms, some steps could be 
parallelized, but that is not a subject of this paper. Still, there is one thing that 
can be improved here. Notice, that an iterative optimization algorithm needs to 
compute objective function at least once per iteration to assess currently found 
solution. Our objective function g{H) can take a long time to compute, and 
while it is computed, other optimization algorithm steps cannot be processed. 
Consequently, the time of finding the optimal H matrix can be improved if we 
optimize computing of g{H) function. Unfortunately, we cannot use a set of 
precomputed values like the ones used to speed up the LSCV_h algorithm, but 
we can adapt the algorithm used to compute those values. Compare exponents 
in equations dMl) and (1411) (algorithm LSCV_h) with exponents in equations 
dMl) and (IMl) (algorithm LSCV_H). Both exponents are computed similarly. 
The main difference is that in LSCV_h algorithm the matrix is constant 
and exponents may be precomputed, while the corresponding H~^ matrix in 
LSCV_H algorithm is not constant. Consequently, to compute a single value of 
g{H) both steps: computing exponents and reducing T function value have to 
be performed. To make this solution a little bit more cache friendly, we combine 
both: exponent finding algorithm described in section 15.51 and function value 
reduction algorithm presented in section 15.31 into one algorithm. 


GPU implementation Let us start with describing the modification of the 
GPU algorithm. Recall the algorithm description from section 15.51 The process¬ 
ing in this algorithm is done in tiles: 256 threads process 256 combinations of two 
vectors from the matrix X. Each tile is represented as a single thread block. We 
adapt this algorithm by adding additional steps for each thread, after it finishes 
computing of its corresponding value. Based on this value, T[H) function (see 
equation (1551) 1 is computed and the result is stored in the additional buffer in the 
shared memory. Next, all threads within a block are synchronized and then per¬ 
form parallel reduction algorithm on the obtained values. Consequently, after all 
threads within a block finish processing a tile, a single value, which constitutes 
partially performed nested sums from the equation (1321) is obtained. This value 
is then stored in global memory for further reduction by using the algorithm 
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from section [5.21 The computed sum is then used to finalize computation of the 
g{H) function. 


SSE implementation Modifications of the SSE algorithm are pretty similar. 
Recall that here each 16 value row processed is implemented by four __ml28 
variables. Consequently, each computed row of a processed tile is returned in 
this form. After such row is computed, it is not stored in the resulting array, but 
SIMD implementation of the T{H) function is computed on each of these four 
variables. Next, all of these variables are added to a single __ml28 accumulator. 
This process is repeated for all rows computed during processing of a tile. Finally, 
the four values stored in the accumulator are added to obtain a single value 
which constitutes partially performed nested sums from the equation ([32]). These 
values are then stored into an array for further reduction by using the algorithm 
from section 15.21 Computed sum is then used to finalize computing of the g{H) 
function. 


7 Experiments 

7.1 Environment and implementation versions 

For the purpose of experiments we have implemented PLUGIN, LSCV_h and 
LSCV_H algorithms (see section El), each in 3 versions: Sequential implementa¬ 
tion, SSE implementation and GPU implementation. LSCV_H implementations 
only implemented computing of the g{H) objective function, as this is the only 
element of this algorithm that has influence on its performance. We have also 
made a small change in LSCV_h algorithm. To make performance of this algo¬ 
rithm completely data independent, we have slightly modified the last part of 
the algorithm where we search for a minimum of the objective function on a grid. 
The objective function is computed for a fixed number of points (150) within 
the computed interval as opposed to section 14.41 where only distance between 
the tested points is specified, and their number is dependent on the width of the 
computed interval. In all versions we have used ALGLIB library to perform 
matrix square root and Horner’s method for accelerating calculations of poly¬ 
nomial values (in portions of equations (HSI and (UHl)). Each implementation 
uses single precision arithmetic. As was stated in section 13.21 single precision 
arithmetic is much more efficient on GPUs than double precision. Notheless, 
performance of double precision computations on GPUs grow with each new 
graphics card and should be adequate in the near future. Only simple modifica¬ 
tions to the presented solutions, which take into account mainly efficient access 
to shared memory, need to be made in order to accomodate double precision. 
To make the results of performance tests of CPU and GPU comparable we have 
decided that the Sequential and SSE implementations should also utilize single 
precision. Still one should notice that both of these implementations could uti¬ 
lize double precision as well. For sequential implementations changes are trivial. 
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SSE implementation could be changed in two ways to accomodate double pre¬ 
cision: either operate on 128bit vectors holding two double precision instead of 
four single precision values (utilize SSE2 instructions) or operate on 256bit vec¬ 
tors containing four double precision values (utilize AVX instructions and new 
registers). Below we give a short description of each of the versions. 

— Sequential implementation - a “pure C” straightforward sequential im¬ 
plementation of formulas presented in sections 14.41 (PLUGIN algorithm), 
14.41 and 14.51 (LSCV.h algorithm) algorithm, and 14.41 (LSCV_H algorithm). 
This implementation does not take into account any knowledge about the 
environment it is going to be executed in. It is not cache aware. The only 
cache awarness in this implementation is reflected in loops construction and 
memory storage which avoids non linear memory accesses (up to transposing 
matrices if necessary). This implementation does not use any SSE instruc¬ 
tions and is single threaded. Reductions are performed only using sequential 
implementation of the hierarchichal algorithm presented in section 15.31 

— GPU implementation - implementation that accelerates computations 
using CUBA API to execute computations on a GPU. This implementation 
is highly parallel and uses thousands of threads. Implementation tries to 
utilize multiple GPU memory types, including very fast shared memory, 
which may be treated as a user programmable cache. Implementation also 
uses G-l—I- templates to automatically create several versions of each gpu- 
kernel with unrolled loops. For details see section l6T] IPLUGIN algorithm), 
16.21 ILSCV-h algorithm) and 16.31 ILSCV-H algorithm). 

— SSE implementation - implementation that tries to utilize all ways of ac¬ 
celerating performance available on CPUs, i.e. it utilizes SSE instructions, 
multiple threads (to utilize multiple CPU cores and HyperThreading capabil¬ 
ities) and is cache aware. OpenMP [6] is used for thread management. Imple¬ 
mentation also uses C-I--I- templates to automatically create several versions 
of each function with unrolled loops. For details see section [01 (PLUGIN 
algorithm), inm (LSCVJr algorithm) and 16.31 1LSCV_H algorithm). 

Experiments were performed on a computer with Intel Core 17 CPU work¬ 
ing at 2.8GHz, NVIDIA GeForce 480GTX graphics card and 8GB of DDR3 
RAM. The CPU supports SSE4 instruction set, has 4 cores and HyperThread¬ 
ing capability. The GPU has 15 multiprocessors, each composed of 32 streaming 
procesors. All implementations were run under Linux operating system (Arch 
Linux distribution, 3.3.7 kernel version). 

All of the tested implementations may be downloaded from 
http://www.cs.put.poznan.pl/wandrzejewski/_sources/hcode.zip, 

7.2 Experiments and datasets 

We have performed several experiments testing the influence of the number of 
samples (n) and their dimensionality (d) on the performance of all of the imple¬ 
mentations introduced in section ITm The input sample sizes and dimensionalities 
for each of the algorithms were as follows: 
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— PLUGIN algorithm: n = 1024,2048,32768, d = 1 (PLUGIN algorithm 

only supports one dimensional data). 

— LSCV_h algorithm: n = 64,128,..., 1024, d = 1,..., 16. 

— LSGV.H algorithm: n = 1024, 2048,..., 16384, d = 1,..., 16. 

All processing times measured for GPU implementations include data trans¬ 
fer times between GPU and GPU. 

From the obtained processing times we have calculated speedups achieved 
for each algorithm and its implementation: 

— SSE over Sequential implementation, 

— GPU over Sequential implementation, 

— GPU over SSE implementation. 

The results are presented in the next section. 

As the input data does not influence processing times of tested implementa¬ 
tions (except maybe LSGV_H method where the performance of a minimization 
algorithm depends on the dataset, but in our implementation we skip this part), 
we supply random dataset for each algorithm/sample size/dimensionality size 
combination. In other words, each implementation of an algorithm is tested in 
the same conditions. 

7.3 Experimental results 

Let take a look at Figure Ha] It presents speedups of GPU PLUGIN algorithm 
implementation over Sequential implementation. As can be noticed, GPU imple¬ 
mentation is about 500 times faster than sequential implementations. Moreover, 
notice that the bigger the instance is the greater speedup is obtained. Figure 
[8b] presents speedups of SSE PLUGIN algorithm implementation over Sequen¬ 
tial implementation. The obtained speedups are about 32. Notice, that similarly 
as with GPU implementation, the obtained speedup grows as the size of the 
instance increases. Figure |8c] presents speedups of SSE PLUGIN algorithm im¬ 
plementation over Sequential implementation. The maximum obtained speedup 
is about 16 and the speedup grows as the size of the instance increases. 

First, let us explain the observed increase in obtained speedups. PLUGIN 
algorithm has the complexity of O(n^) (the most complex part is computing 
double sums in equations (1161) and (1181) 1. Gonsequently, for each implementation 
there exists a second order polynomial which can be used to determine the time 
needed to process an instance of size n. Speedup is calculated by dividing the 
processing times of the slower implementation by the processing times achieved 
by the faster implementation. Let F{n) = afn^ + bfu + Cf be the polynomial 
that determines the processing time of the faster implementation and S{n) = 
agU^ + bsU + Cs be the polynomial that determines the processing time of the 
slower implementation. Speedup is calculated as follows: 

S{n) asU^ + bgU + Cs 
F{n) afn'^ + bfn + Cf 


Speedup{n) 


(61) 
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Speedup of GPU PLUGIN implementation over Sequential PLUGIN implementation 



(a) GPU implementation speedup over Sequential implementation 


Speedup of SSE PLUGIN implementation over Sequential PLUGIN implementation 



(b) SSE implementation speedup over Sequential implementation 


Speedup of GPU PLUGIN implementation over SSE PLUGIN implementation 



(c) GPU implementation speedup over SSE implementation 


Fig. 8: Total speedups of all PLUGIN algorithm implementations 
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Notice that: 



(62) 


whereas: 


Zim. 


-n—^oo 


Speedup{n) = 


(63) 


It can be therefore determined that speedup starts from some value ^ and 
asymptotically approaches — for sufficiently big instances. Here — value is 
smaller than ^ and consequently the speedup increases with the number of 
samples n. Using this framework we can also determine the maximum speedup 
in the given hardware and software conditions, by first fitting a second order 
polynomial to the obtained processing times using least-squares measurement 
and then calculating limit using equation (I63|) . 

The speedup values shown in Figure I8bl can be explained as follows. The 
SSE algorithm uses SIMD instructions, performing the same operation on four 
different values. Moreover, a quad core processor with hyperthreading capabili¬ 
ties is used (sequential implementation uses only a single thread and hence only 
one core). Four cores times four times faster processing of data should lead to 
about 16 times speedup. Notice however, that due to HyperThreading capabil¬ 
ities, each core can process multiple threads more efficiently (we have used 8 
threads - two per core). Moreover, additional cache aware optimizations make 
the SSE implementation even faster as little time is wasted on data retrieval 
from RAM. 

The speedup of about 16 of GPU implementation over SSE implementation 
and 500 over sequential is hard to explain similarly as was done for the SSE over 
Sequential implementation speedup as both architectures are very different. 

Lets us now take a look at FigurelHl It presents speedups of GPU LSGVJi im¬ 
plementation and SSE implementation over Sequential implementation (Eigures 
[nil and |9b] respectively) and Speedup of GPU implementation over SSE imple¬ 
mentation (Figure |9c]). Each curve represents a different data dimensionality. 
Speedups achieved here are as follows: 

— GPU implementation is about 550 times faster than Sequential implemen¬ 
tation, 

— SSE implementation is about 20 times faster than Sequential implementa¬ 
tion, 

— GPU implementation is about 20 times faster than SSE implementation. 

Observations that can be made here are very similar to the ones made for PLU¬ 
GIN algorithm, as: 

— Algorithm’s complexity is 0{n?'{(P + Uh)) (see section H751) . Given the fact 
that d and Uh are constant, the complexity is reduced to O(n^), i.e. the same 
as complexity of the PLUGIN algorithm. 
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Speedup of GPU LSCV_h implementation over Sequentiai LSCV_h impiementation 

d=l — 
d=2 — 
d=3 — 
d=4 — 
d=5 — 
d=6 — 
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d=ll — 
d=12 — 
d=13 — 
d=14 — 
d=15 — 
d=16 — 



(a) GPU implementation speedup over Sequential implementation 

Speedup of SSE LSCV_h implementation over Sequentiai LSCV_h impiementation 
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(b) SSE implementation speedup over Sequential implementation 

Speedup of GPU LSCV_h implementation over SSE LSCV_h implementation 
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(c) GPU implementation speedup over SSE implementation 

Fig. 9: Total speedups of all LSCV_h algorithm implementations 
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— Most of the algorithm processing time constitutes of compnting of g{h) func¬ 
tion (see equation (1431) 1 which is performed by using data reduction algo¬ 
rithms (the same algorithms are used in PLUGIN implementations). 

One can also make several interesting observations. 

First, for each dimensionality the speedup limit is slightly different. This 
is due to the compiler optimizations in which loops are unrolled, and for each 
dimensionality the compiler creates a different version of used procedures. Con¬ 
sequently, each curve represents in fact a slightly different program. Moreover, 
the bigger the dimensionality, the bigger the speedup is, but for bigger dimen¬ 
sionalities the difference between them is smaller. This may be explained sim¬ 
ilarly as we have explained a similar fenomenon earlier (the one in which the 
speedup was increasing with respect to the number of samples n). Let us recall 
that LSCV_h algorithms complexity is 0{n?'{(P +nh))- For a specified constant 
values of n and Uh (arbitrarily set to 150), the complexity is reduced to 0{(P). 
Consequently, processing times may be determined by a second order polynomial 
where dimensionality is the independent variable. As speedup is calculated by 
dividing the processing times of the slower implementation by the times achieved 
by the faster implementation, it is basically a proportion of two second order 
polynomials, which has a limit at d —>■ oo. 

Second observation is that the curves in Figures and lilcl are not smooth. 
These disturbances are caused by SSE implementation which achieved for n = 
128 and to a lesser extent n = 384 a little bit worse performance then expected. 

Finally, let us analize Figure (TU] which presents speedups of GPU LSCV_H 
implementation and SSE implementation over Sequential implementation (Fig¬ 
ures and llObI respectively) and speedup of GPU implementation over SSE 
implementation (Figure [TO^). Similarly, as in LSCV_h algorithm we have tested 
processing times for data dimensionalities equal to d = 1,..., 16 and computed 
speedups based on the obtained values. Speedups achieved here are as follows: 

— GPU implementation is 290 times faster than Sequential implementation, 

— SSE implementation is 20 times faster than Sequential implementation, 

— GPU implementation is 10 times faster than SSE implementation. 

Most of the discussion for PLUGIN and LSCVJr algorithms presented earlier can 
also be used to explain results obtained for LSCV_H algorithm implementations. 
Recall that our LSCV_H implementation is not complete. We have only tested 
processing time needed to compute g{H) function (equation Let us deter¬ 

mine the complexity of computing g{H) function. The g{H) function contains 
double sums which add 0{n^) TniXi — Xj) function values. Gomputation of the 
TniXi — Xj) function requires 0{(P) operations. Gonsequently, computation or¬ 
der of g{H) function is 0{n^cP). This in turn leads to the earlier discussion of 
representing function processing times with second order polynomials with ei¬ 
ther n or d as independent variable, and the second variable constant. As was 
stated earlier, if processing times are expressed by second order polynomials, 
then speedup function (defined as ratio of two polynomials) is a function that 
has a limit at 0 and at oo. All of the observed values approach some asymptotic 
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value which is equal to the limit of the speedup function at the infinity boundary. 
Starting value (for low n or d values) is close to the speedup function limit at n 
(or d) tending to 0. Several other observations can be made. First, take a look 
at Figure ITOal Curves for dimensionalities d = 1,...,3 are different from the 
rest of the observed ones. Notice that here the speedup seems to drop with the 
increase in the number of samples (n). These observations however, though look 
differently than previously analysed on other figures, also fit our ealier discussion 
regarding limits of the speedup function defined as a ratio between two second 
order polynomials. Let us consider a situation where n = 16384. As n is constant, 
processing times can be determined by a second order polynomials dependent on 
d. For low values of d (d = 1,..., 3) we observe high speedups which descrease as 
d increases. Such observation means that probably the constant part of the poly¬ 
nomial is smaller for GPU implementation than for Sequential Implementation. 
However, in such a case it would mean that GPU implementation requires less 
initialization time than Sequential version. This is certainly not true, as GPU 
Implementation needs to perform several additional tasks, such as data transfer 
to device memory. This phenomenon may be explained as follows. Even though 
the number of dimensions is low, the sequential implementations outer loops are 
dependant on v? (they iterate over every combination of two samples from the 
dataset). This means that there is a constant (we assume n = 16384) processing 
time required for processing of those loops (branch prediction etc.). Low values 
of d mean that the inner loops, which compute equation (1331) . require less time 
and therefore constitute a lesser percentage of the whole algorithm processing 
time. The same situation does not influence GPU implementation as much, as 
outer loop iterations are performed in parallel. Gonsequently less time is wasted 
on -n? dependant loop processing costs. This in turn leads to conclusion that for 
high n and low d speedup is higher. Now, let us assume that n = 1024. Low 
values of n mean low outer loop processing costs and GPU typical initialization 
costs start to dominate, which leads to smaller speedups for low values of d. This 
can be also noticed in Figure flOal One could ask why the same phenomenon was 
not observed for LSGV_h algorithm. This stems from the fact that in LSGV_h 
algorithm the values of equation (which is equation’s (1551) counterpart) are 
computed only once, and other algorithm’s tasks dominate. 

Finally, in Table [51 for illustrative purposes only, we present comparison of 
processing times obtained for largest instances in each of the experiments. 


Table 3: Processing times for largest instances in experiments 


Algorithm (instance size) 

Implementation [ms] 

GPU 

SSE 

Sequential 

PLUGIN (n = 32768) 

87.9 

1442.3 

47435.3 

LSCVJi (n = 1024, d = 16) 

14.7 

344.1 

8283.6 

LSCVJH (n = 16384, d = 16) 

184.2 

2320 

53258.8 
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Speedup of GPU LSCV_H implementation over Sequentiai LSCV_H impiementation 
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(a) GPU implementation speedup over Sequential implementation 



Speedup of SSE LSCV_H implementation over Sequentiai LSCV_H impiementation 



Number of samples 

(b) SSE implementation speedup over Sequential implementation 


Speedup of GPU LSCV_H implementation over SSE LSCV_H implementation 
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(c) GPU implementation speedup over SSE implementation 



Fig. 10: Total speedups of all LSCV_H algorithm implementations 
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8 Conclusion 

In the paper we have presented some methods of how to efficiently compute 
the so called bandwidth parameter used in computing kernel probability density 
functions (KPDF). The functions can be potentially used in various database 
and data exploration tasks. One possible application is the task known as ap¬ 
proximate query processing. However, the serious drawback of the KPDF ap¬ 
proach is that computations of the bandwidth parameter -- a crucial parameter 
in KPDF are very time consuming. To solve this problem we have investigated 
several methods of optimizing these computations. We utilized two SIMD archi¬ 
tectures: SSE CPU architecture and NVIDIA GPU architecture to accelerate 
computations needed to hnd the optimal value of the bandwidth parameter. 
We have tested our SSE and GPU implementations using three classical algo¬ 
rithms for finding the optimal bandwidth parameter: PLUGIN, LSCV_h and 
LSCV_H. Detailed mathematical formulas are presented in section 14.41 As for 
LSCV_h algorithm we have proposed some slight but important changes in the 
basic mathematical formulas. The changes allow us to precompute some values 
which may be later reused many times. The details can be found in section 
Q] The fast SSE and CUDA implementations are also compared with a sim¬ 
ple sequential one. All the necessary details on using SIMD architectures for 
fast computation of the bandwidth parameter are presented in section [5] and 
the final notes on how to utilize the algorithms are presented in section [51 Our 
GPU implementations were about 300-500 times faster than their sequential 
counterparts and 12-23 times faster than their SSE counterparts. SSE imple¬ 
mentations were about 20-30 times faster than sequential implementations. The 
above results confirm the great usability of modern processing units. All the 
codes developed have been made public available and can be downloaded from 
http://WWW.cs.put.poznan.pl/wandrzej ewski/_sources/hcode.zip. 
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A Derivation of equations (H^ and flSU]) 

In this appendix we provide detailed derivation of equations (l4^ and (l50l) used 
for calculating thread block’s position within an upper triangular matrix. We 
assume that the thread blocks are started within a one dimensional grid and are 
assigned unique non negative consecutive numbers bx (see Figure!?]). Based on 
bx we want to compute block’s column (denoted 1) and block’s row (denoted q) 
within an upper triangular matrix. 

We shall start with deriving the number of blocks Ub contained in triangular 
matrix. As each consecutive column has one more block than the previous one 
(i.e. column number I contains I +1 blocks) the number of blocks up to a column 
number I (i.e. n = I + 1 columns) may be calculated as a sum of arithmetic 
progression: 


(64) 

(65) 

( 66 ) 


ai = 1, 

CLjl — /-)-!, 

Tib — Ti(^ai -\- Oby^jl = (/-)- !)(/ -I- 


Let us now assume that we have a given bx number of a block and we know that 
this block is on the main diagonal of the triangular matrix. We will now find the 
column number I for this block. Let us consider a submatrix of the triangular 
matrix such that the &a:-th block is in the lower right corner of this submatrix. 
As bx numbers are zero-based we know that there must he rib = bx + 1 blocks 
in this sumbatrix. Let us substitute this to the equation (1661) : 
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(/ + 1)(^ + 2)/2 — bx + 1, 
f + 3l- 2bx = 0. 


(67) 

( 68 ) 


We solve the quadratic equation (1^ : 


A = y/8bx + 9, 


(69) 

(70) 

(71) 


h = {-V8bx + 9-3)/2, 


h = (\/86x + 9-3)/2. 


As the result of equation (EOl) is always negative the column number we are 
searching for is given by equation CD. 

Let us now assume that bx block is not on the main diagonal. Let bxi be 
the biggest block number such that bxi < bx and let bx 2 be the smallest block 
number such that bx < bx 2 - Let Ibxi be the column of bxi block and Ibxi be the 
column of bx2 block. It is easy to notic, that hxi + 1 = ■ Consequently, the I2 

column calculated for bx block using equation CD must be a non integer value 
hxi < h < hx2- Moreover, as bxi is the number of the last block in its column, 
bx block must be in the same column as bx 2 block, i.e. its column number is the 
smallest integer value greater or equal to I2 calculated using equation (1711) . This 
leads to the equation (H^ : 


\/ 8bx + 9 — 3 
2 


To find block’s row number q, we need to subtract from bx the number of 
blocks in previous columns. This can be easily found by substituting I — 1 into 
equation (1551) . which leads to the final equation (1501) : 




q = bx 


2 










